Objective
The objective of the task is to analyse the pricing of car insurance premiums of Suncorp against its three competitors. The dataset consists of factors that affect the pricing of insurance premiums like vehicle details, owner / driver information, the details of other insurances on the vehicle, etc.
We will do some basic cleaning of the data and then carry out exploratory data analysis. To explore the relationship between the Insurance prices and the feature set we will create a minimal Shiny App and host it here . We will also build a simple pricing model and provide recommendations to SunCorp
Data Cleaning
As a first step, we will carry out some basic cleaning
Eliminate Duplication
- The Policy Commencement field has only 2 dates and both have same count of records.
- The sum,mean and standard deviation of PREMIUM columns also gives same values for all the insurers except for INSURER2
- This could possibly mean that the three insurers (1,3 and 4) have not done a pricing review between the 2 dates
- The average price increase for INSURER2 is around 2%
- To avoid duplication we will only consider the 1st Sep 2019 records for our analysis
|
Type
|
COMMENCEDATE
|
INSURER1_PREMIUM
|
INSURER2_PREMIUM
|
INSURER3_PREMIUM
|
INSURER4_PREMIUM
|
|
Count
|
2019-07-01
|
29752.00
|
29752.00
|
29752.00
|
29752.00
|
|
Count
|
2019-09-01
|
29752.00
|
29752.00
|
29752.00
|
29752.00
|
|
Sum
|
2019-07-01
|
7460723.80
|
8499325.77
|
8961579.87
|
6955641.12
|
|
Sum
|
2019-09-01
|
7460723.80
|
8659091.81
|
8961579.87
|
6955641.12
|
|
Mean
|
2019-07-01
|
250.76
|
285.67
|
301.21
|
233.79
|
|
Mean
|
2019-09-01
|
250.76
|
291.04
|
301.21
|
233.79
|
|
Std. Dev
|
2019-07-01
|
67.03
|
66.88
|
52.43
|
53.01
|
|
Std. Dev
|
2019-09-01
|
67.03
|
64.47
|
52.43
|
53.01
|
Other Data Manipulations
- We will convert the character columns to factors
- And also few other manipulations for readbility and to maintain consistency
Exploratory Analysis
Now lets look at the individual feautures
Numeric Features
- The three numeric ‘age’ features have an almost normal distribution
- This is expected as majority of the people drive in their mid-ages.
- The record count of vehicle age is nearly constant from 4 to 12 years, after which it starts dropping. This probably means that new cars are insured by a different providers or it could also mean that people have switched their insurance companies for their existing older cars. We would get a better idea if the volumes were provided.
- For modeling purposes, these columns could be converted to bins but for now we will leave them as is

Categorical Features
- The below charts gives the distrbution of the categorical features
- Due to the rare-level merging that we did earlier the ‘OTHERS’ category in the MODEL feature now has high frequency of records
- The LICCANCNT has a very low proportion (0.0043) of records and could ideally be ignored. But since it contains the conviction/suspension information we will retain it.

Feature Correlations
- We do not see much collinearity in the feature set as we have already done some cleaning in the previous steps
- We can see that the Owner Age and Youngest driver age is highly correlated, possibly an indication that the majority of the car owners are the sole drivers of the vehicle
- The MVINSTYPE = N has a high correlaion with MVINSURER= NONE, but we will not treat it
- Assuming the Insurance Premiums to be the target variables, we notice they have slight correlations with the feature set. We will explore this further in the feature importance section

Insurance Premium columns
In this section, we will take a deeper look at how the four insurers are pricing their products?
Frequency Distribution
- The premium prices for all the insurers are in the range of 190-390
|
|
INSURER1_PREMIUM
|
INSURER2_PREMIUM
|
INSURER3_PREMIUM
|
INSURER4_PREMIUM
|
|
|
Min. :197.1
|
Min. :192.9
|
Min. :202.0
|
Min. :194.3
|
|
|
1st Qu.:209.3
|
1st Qu.:239.8
|
1st Qu.:262.0
|
1st Qu.:206.3
|
|
|
Median :219.0
|
Median :277.4
|
Median :295.3
|
Median :211.1
|
|
|
Mean :250.8
|
Mean :291.0
|
Mean :301.2
|
Mean :233.8
|
|
|
3rd Qu.:247.9
|
3rd Qu.:367.9
|
3rd Qu.:338.9
|
3rd Qu.:225.5
|
|
|
Max. :386.4
|
Max. :385.2
|
Max. :388.2
|
Max. :385.3
|
- Below box plots shows their distributions. We can see that:
- The premium prices of Insurer 2 and 3 are higher on average
- Insurer 2 has a wider IQR, which is a good strategy as their prices would provide more/customised options to their customers. This could possibly result in higher revenue
- Insurer 3 has the highest median price
- Insurer 1 and 4 are similar in their pricing
- They both have a lower median price of around 210-220, which is much lower than their competitors.
- And they also have outliers

- Looking at histograms
- It looks like the prices of Insurer 2 and 3 are capped to an upper limit, probably as a result of some outlier treatment
- The frequency distribution of Insurer 1 and 4 is not normal which indicates they have a pricing strategy which is based on price slabs. (ie the pricing is done in steps)
- As compared to the pricing of Insurer 2 and 3, which is more dynamic and smooth

Pricing of YDAGE column
From the above analysis we can see that YDAGE is an important feature. Lets look into more details
- Below boxplots shows how the different insurers price their products for different ‘Youngest Driver’ age groups
- If we compare the distribution of Insurer 1 and 4 with that of 2 and 3, we can see that Insurer 2 and 3 have much smoother price curves
- Insurer 1 and 4 charge uniformly for different age groups
- Similar plots for other top features (not shown) also indicates that the pricing strategy of Insurer1 is not dynamic
- All the remaining features can be explored using the Shiny App hosted on shiny.io

Pricing Model
We will build a very basic pricing model for Insurer 1 using XGBoost algorithm
Process
- We will use the prices of Insurer1’s competitors as Target variable
- The model learns from the pricing strategies applied by the competitors and comes out with the prices that can be used by Insurer1
- Insurer1 could potentially use the predicted prices to conduct an A/B testing and measure the impact on its sales and revenue
Model Evaluation
- If we look at the below boxplots, we can see that the model predicts a median price which is higher than its existing price and lower than Insurer 2 and 3
- The distribution of the prices is approximately normal as seen in the histogram
Limitations
- Prior Sales and the Market Demand are the key factors used in the pricing of any products. The pricing models for vehicle insurance products also heavily rely on past accidents and claims, none of which were available for modeling
- Key model improvement strategies like feature engineering, resampling, hyperparameter tuning, model ensembling, etc were not utilised
- The pricing structure predicted by the model is better than the one currently used by Insurer1, but the model itself is nothing better than an average model and only included here as a POC (Proof of Concept)


Summary
- Insurer1 is currently having a pricing strategy in which their pricing is done by slabs. This means the customers will have less buying options to choose from. If they adopt a dynamic pricing (with a smoother pricing curve), it could help in generating incremental revenue
- The pricing of Insurer1 is majorly based on the Age columns, they could consider other available factors
- The insurer can probably conduct A/B testing using the prices predicted by basic model presented above and measure its impact on sales and revenue.